analysis
groceries.txt file contains a total of 9,835 unique
shopping baskets. We frst went through some data wrangling process
before conducting Market Basket Analysis using the “arules” package. As
for the thresholds, we chose support of .001, confidence of .5, and
maxlen of 10. A relatively low support of .001 was chosen because we
wanted to capture as many items as possible from the dataset. Confidence
of .5 was chosen to sort out weak associations. Lastly, we limited the
maximum number of items per item set to be 10 to account for as many
possible grocery combinations as possible. Runinng the algorithym using
the above threshold resulted in 5,668 rules, which we thought was enough
for this analysis. Below are two plots showing the resulting rules; the
first is plotted between support and lift, while the second is between
support and confidence.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].


Below is a table that shows the top ten rules with the highest
confidence. Confidence shows the probability of having item(s) on the
RHS given those on the LHS are purchased. You can see that out of the
top ten rules, the most frequent RHS items are milk and other
vegetables. However, this does not reveal much about association. Take
{canned fish,hygiene articles} -> {whole milk} as an
example. Intuitively, buying canned fish and hygiene articles doesn’t
seem to have anything to do with buying whole milk. However, this this
is still at thoe top of the list simply because whole milk gets bought
the most frequently when people go grocery shopping, regardless of what
other items they purchase. To see more relevant association rules, let’s
look at a list with the highest lift.
Top 10 rules with the highest confidence
|
LHS
|
RHS
|
support
|
confidence
|
coverage
|
lift
|
count
|
|
{rice,sugar}
|
{whole milk}
|
0.0012200
|
1
|
0.0012200
|
3.914047
|
12
|
|
{canned fish,hygiene articles}
|
{whole milk}
|
0.0011183
|
1
|
0.0011183
|
3.914047
|
11
|
|
{butter,rice,root vegetables}
|
{whole milk}
|
0.0010167
|
1
|
0.0010167
|
3.914047
|
10
|
|
{flour,root vegetables,whipped/sour cream}
|
{whole milk}
|
0.0017283
|
1
|
0.0017283
|
3.914047
|
17
|
|
{butter,domestic eggs,soft cheese}
|
{whole milk}
|
0.0010167
|
1
|
0.0010167
|
3.914047
|
10
|
|
{citrus fruit,root vegetables,soft cheese}
|
{other vegetables}
|
0.0010167
|
1
|
0.0010167
|
5.168681
|
10
|
|
{butter,hygiene articles,pip fruit}
|
{whole milk}
|
0.0010167
|
1
|
0.0010167
|
3.914047
|
10
|
|
{hygiene articles,root vegetables,whipped/sour cream}
|
{whole milk}
|
0.0010167
|
1
|
0.0010167
|
3.914047
|
10
|
|
{hygiene articles,pip fruit,root vegetables}
|
{whole milk}
|
0.0010167
|
1
|
0.0010167
|
3.914047
|
10
|
|
{cream cheese ,domestic eggs,sugar}
|
{whole milk}
|
0.0011183
|
1
|
0.0011183
|
3.914047
|
11
|
Below is a table showing the top ten rules with the highest lift.
Lift is different from confidence in that it is the ratio between
confidence and expected confidence. In other words, lift measures the
relative strength of association between LHS and RHS. It takes care of
the high frequency issue of whole milk purchase we observed above. Lift
> 1 indicates that the association rule improves the chances of
outcome, where as lift < 1 reveals that the model lowers the chance
of the outcome. Lift = 1 does not have any effect on the outcome. The
result here is much more interesting and informative.
{popcorn,soda} -> {salty snack} Here, it seems like
people are getting ready for a movie night. People who buy popcorn and
soda are likely to buy other salty snacks. Thus, the model makes
sense.
{ham,processed cheese} -> {white bread}These are
ingredients to make a quick sandwich. Hence, the rule makes sense
again.
Top 10 rules with the highest lift
|
LHS
|
RHS
|
support
|
confidence
|
coverage
|
lift
|
count
|
|
{Instant food products,soda}
|
{hamburger meat}
|
0.0012200
|
0.6315789
|
0.0019317
|
18.99759
|
12
|
|
{popcorn,soda}
|
{salty snack}
|
0.0012200
|
0.6315789
|
0.0019317
|
16.69949
|
12
|
|
{baking powder,flour}
|
{sugar}
|
0.0010167
|
0.5555556
|
0.0018300
|
16.40974
|
10
|
|
{ham,processed cheese}
|
{white bread}
|
0.0019317
|
0.6333333
|
0.0030500
|
15.04702
|
19
|
|
{Instant food products,whole milk}
|
{hamburger meat}
|
0.0015250
|
0.5000000
|
0.0030500
|
15.03976
|
15
|
|
{curd,other vegetables,whipped/sour cream,yogurt}
|
{cream cheese }
|
0.0010167
|
0.5882353
|
0.0017283
|
14.83560
|
10
|
|
{domestic eggs,processed cheese}
|
{white bread}
|
0.0011183
|
0.5238095
|
0.0021350
|
12.44490
|
11
|
|
{other vegetables,tropical fruit,white bread,yogurt}
|
{butter}
|
0.0010167
|
0.6666667
|
0.0015250
|
12.03180
|
10
|
|
{hamburger meat,whipped/sour cream,yogurt}
|
{butter}
|
0.0010167
|
0.6250000
|
0.0016267
|
11.27982
|
10
|
|
{domestic eggs,other vegetables,tropical fruit,whole milk,yogurt}
|
{butter}
|
0.0010167
|
0.6250000
|
0.0016267
|
11.27982
|
10
|
The last plot is a graph-visualization representing the association
rules. Each item in the LHS is connected with to the RHS item, and the
arrows indicate the direction of the relationship.
#plot
#saveAsGraph(subset, file = "groceriesrules.graphml")
# graph-based visualization
# export
# associations are represented as edges
# For rules, each item in the LHS is connected
# with a directed edge to the item in the RHS.
groceries_graph = associations2igraph(subset)
igraph::write_graph(groceries_graph, file='groceries.graphml', format = "graphml")
grViz("groceries.graphml")